An important first step in any data analysis is to display the important features of the data in a plot. We will now learn how to do this, using the ggplot package.
library(ggplot2)
Let’s also read in a few example data sets to work with. This will give us a few different things to plot.
We have already seen the birth weights data:
bw = read.csv("data/birth_weights.csv")
head(bw)
The fat data record the physical measurements of some people. Each row is one person and the columns give three measurements:
fat = read.csv("data/fat.csv")
head(fat)
The marriage data record the mean age at which men and women in the USA married. Each row represents one mean age and the other columns record:
marriage = read.csv("data/marriage.csv")
head(marriage)
The salience data record measurements from multiple subjects in a psychophysical experiment that involved looking at an object. Each row represents one attempt at a speeded reaction task. The columns record:
salience = read.csv("data/salience.csv")
head(salience)
The deaths data record the proportion of deaths among children and young people in the USA, in 1950 and in 2005. Each row represents one death count. The columns record the following variables:
(We will mostly use subsets of this data frame, to make the examples simpler.)
load("data/deaths.RData")
head(deaths)
The titanic data are adapted from data recorded by the British Board of Trade following the sinking of the Titanic in 1912. Each row represents one passenger. All the columns are factor variables, recording the following pieces of information:
tt = read.csv("data/titanic.csv")
head(tt)
Finally, the wine data record various characteristics of several wines (too many to be worth listing here), rated by expert wine tasters.
wine = read.csv("data/wine.csv")
head(wine)
The gg in ggplot stands for grammar of graphics, a particular approach to describing plots of data. You can read about it in more detail here, but the core idea is to describe all different kinds of plots in terms of a few basic components:
There are various other plot components implemented in ggplot, but we won’t always need to use all of them. These three are the important ones.
One of the advantages of this system is that we do not need a separate R command for every kind of plot that we want to create. Instead, we can build almost any plot we want out of a limited set of components. So there are no functions called, for example, scatterplot() or barchart() in ggplot. Instead, everything begins with just one function, ggplot(), which is used to specify the first two ingredients of the plot: the data and the aesthetic mappings.
The first input to ggplot() is the data frame containing the data we want to plot. The second input is itself a function, the aes() function. aes is short for aesthetic, and this function organizes the aesthetic mappings. Inside the aes() function, we assign variables from the data set to dimensions of the plot. We do this using the same = that we use for assignment in general.
So to display the babies’ birth weights along the y dimension and the mothers’ weights along the x dimension:
ggplot(bw, aes(y=Birth_weight, x=Weight))
In this first extremely minimal example, we didn’t yet add the third component mentioned above: a geometric object. So the plot does not yet show the data in any form. But notice that ggplot already does some useful automatic jobs. It has created the x and y scales with the necessary range, labeled them with the names of the variables that we mapped to them, and added gridlines to the plot background.
Now we will do it again with points as our geom, to produce a plot that actually shows the data. Geoms are added on to the core plot definition using +. Each geom has its own function, and these functions all begin with geom_. The function that we want for this example is called geom_point(). Because we have already defined the organization of the plot with aes(), geom_point() doesn’t need any input telling it what to plot or where.
ggplot(bw, aes(y=Birth_weight, x=Weight)) + geom_point()
If we use any other dimensions when defining the plot, such as color, size, or shape, any geom that is able to represent that dimension will take this into account. For example, points will be shown in different colors if a variable is mapped to the color dimension. A legend for the colors is added automatically.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker)) + geom_point()
We will often want to have a few different versions of our plot, for example a basic version showing the individual observations in the data, perhaps a version showing just mean values, a version that splits the data into two subgroups, and so on. It would be a bit tedious to have to repeat the basic underlying plot definition for each version, especially once our plot becomes quite complex. We should avoid this unnecessary repetition. The first step in doing so is to store the most basic form of our plot, so that we can use it repeatedly. We can store a plot by assigning it into a variable using =, just as we do for storing other types of information.
fig1 = ggplot(bw, aes(y=Birth_weight, x=Weight)) + geom_point()
Now the plot is stored and we can re-use it under this name. One thing we can do with a stored plot is display it again. This is done with the same print() function used for printing out the contents of variables.
print(fig1)
We can add new components to stored plots with +, and assign the result into a new variable in order to have a different version of the plot. For example, we can add a new aesthetic mapping with aes().
fig2 = fig1 + aes(color=Smoker)
print(fig2)
We can change an existing aesthetic mapping in the same way.
fig2 = fig1 + aes(x=Age)
print(fig2)
As well as new mappings, we can add new geoms to an existing plot. For example, a common accompaniment to geom_point() is a trend line showing a smooth relationship between the x and y variables. We can add this with geom_smooth().
fig2 = fig1 + geom_smooth()
print(fig2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Notice that we also got a short message printed out when we ran geom_smooth(). This is not an error message but simply a warning. Some functions when we run them will print out a reminder of what the function is doing, so that we can check that it is really what we wanted. This occurs most often in cases where the function’s default behavior might not be what some users typically want.
The content of the warning message tells us something about a ‘method’ called ‘loess’, which geom_smooth() has used. loess stands for locally etimated scatterplot smooth. Broadly speaking, it calculates the mean value of the y variable locally for each region of the x scale.
This default option is more complex than we will usually need. To change the behaviors of plotting functions, we must give them some input specifying what aspect of their behavior we want to change and what we want to change it to. For example, we can change the method for geom_smooth() to lm, which stands for linear model. This will show a straight line relationship between the two variables.
fig2 = fig1 + geom_smooth(method=lm)
print(fig2)
Most geom_ functions have lots and lots of options that we can change in order to fine-tune our plot. For example, we can turn off the margin of error region that geom_smooth() draws by default, by setting the se argument to FALSE.
fig2 = fig1 + geom_smooth(method=lm, se=FALSE)
print(fig2)
(When writing a full data analysis, we will almost always assign plots into variables and then add to these plots to produce variations on the plot, as in the examples above. However, in most of the examples in the rest of this tutorial, I have repeated the commands for each new plot in full, so that it is easier to see all of the components of which each example plot consists.)
The order in which we add geoms to the plot matters. They will be drawn in the order that we added them. For example, if we create the plot above by first adding the smooth line and then the points, it will look slightly different. The difference in this case is fairly subtle, but if you look carefully you notice that some of the points are now drawn over the line.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker)) + geom_smooth(method=lm) + geom_point()
When we create a plot with multiple components, we can make our R commands a little neater by putting each component of the plot on a new line after the initial plot definition. We can continue commands on a new line after the + symbol. This doesn’t change anything about the plot, but it makes our R commands easier for other people to read and understand.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker)) +
geom_smooth(method=lm) +
geom_point()
As noted above, the aesthetic mappings of a plot determine which dimensions of the plot are used to represent which variables in the data frame. Any geom that is able to reflect an aesthetic mapping will do so.
We have already encountered the x and y aesthetics in the examples above. x and y are used in almost all plots, since they provide the basic 2-dimensional space of the plot.
In our example above with babies’ birth weights and their mothers’ weight or age, both x and y variables were numeric. This does not have to be the case. In particular the x dimension can also be used to display a factor variable. A common combination is to map a factor variable to the x dimension, then use boxplots as a geom.
The resulting plot compares the spread of y values for each level of the x variable, side by side. (We will see in a moment what exactly the boxplots tell us.)
ggplot(bw, aes(y=Birth_weight, x=Race)) +
geom_boxplot()
Note that the ordering of the factor levels along the x dimension reflects the order defined in the factor variable. By default this is alphabetical. If we change the ordering, it will be reflected in any new plots we create.
levels(bw$Race)
## [1] "black" "other" "white"
bw$Race = factor(bw$Race, levels=c("black", "white", "other"))
ggplot(bw, aes(y=Birth_weight, x=Race)) +
geom_boxplot()
The color aesthetic can be used to show either a factor or a numerical variable. For a factor, it shows each level of the factor in a separate color. Any geom added to a plot with a color mapping will be shown separately in each of the colors.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker)) +
geom_smooth(method=lm) +
geom_point()
ggplot distinguishes between geometric objects that do not enclose an area, such as points and lines, and those that do, such as rectangles, circles, and so on. The color aesthetic mapping determines the color of points and lines only. For shapes with an area, such as the central box of a boxplot, the color mapping will only be reflected in the outline of the shape.
ggplot(bw, aes(y=Birth_weight, x=Race, color=Smoker)) +
geom_boxplot()
The fill aesthetic mapping determines the color of the inside area of shapes. So if we would like to fill the whole area of an object with color, for example a boxplot, then we need to use fill= in the plot definition rather than color=.
ggplot(bw, aes(y=Birth_weight, x=Race, fill=Smoker)) +
geom_boxplot()
Points and lines have no area to fill, so they are unaffected by the fill aesthetic. The shaded margin of error around a smooth trend line does have an area, so it will reflect the fill mapping.
ggplot(bw, aes(y=Birth_weight, x=Weight, fill=Smoker)) +
geom_smooth(method=lm) +
geom_point()
We can map a variable to more than one dimension. It is fairly common to do this for the color and fill mappings, since they both control color but for different kinds of objects or different parts of an object.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker, fill=Smoker)) +
geom_smooth(method=lm) +
geom_point()
Both color and fill can also be used with numeric variables. In this case, the color scale is a continuous gradient that gradually changes from one color to another along the scale of the numeric variable.
ggplot(fat, aes(y=Waist, x=Weight, color=Fat)) +
geom_point()
The size aesthetic determines the size of geoms. This is most useful in combination with points.
ggplot(bw, aes(y=Birth_weight, x=Age, size=Weight)) +
geom_point()
But if we have more than just a few observations in our data, it can be difficult to see the variation in point size. The size aesthetic is best used where we have only a small number of observations or where they are far apart from each other on the plot. For example in the fat data set, which is fairly small, the size aesthetic allows us to display all three numeric variables on one plot to show the trend towards increasing proportions of body fat as waist size and weight increase.
ggplot(fat, aes(y=Waist, x=Weight, size=Fat)) +
geom_point()
Because size varies continuously from small to large, it is best used to represent a numeric variable, and not the levels of a factor. If we show factor levels as different sizes, we give the misleading impression that they form an ordered progression from least to greatest. ggplot warns us if we make this choice.
ggplot(bw, aes(y=Birth_weight, x=Weight, size=Race)) +
geom_point()
## Warning: Using size for a discrete variable is not advised.
The linetype aesthetic draws different kinds of lines for each of the levels of a factor variable: solid, dashed, dotted etc.
ggplot(marriage, aes(y=Marriage_age, x=Year, linetype=Sex)) +
geom_line()
Linetype affects not only the line geom but any lines drawn on the plot, for example the smooth trend lines drawn by geom_smooth().
ggplot(bw, aes(y=Birth_weight, x=Weight, linetype=Smoker)) +
geom_smooth(method=lm, se=FALSE) +
geom_point()
It even affects the lines drawn for the outlines of boxplots. But this is not such a clear way of displaying separate boxplots. The fill aesthetic is better for this.
ggplot(bw, aes(y=Birth_weight, x=Race, linetype=Smoker)) +
geom_boxplot()
Numeric variables cannot be mapped to linetype, because there is no way for the type of line to vary continuously along a scale. If we try to do so then the result is an error.
ggplot(bw, aes(y=Birth_weight, x=Weight, linetype=Visits)) +
geom_smooth(method=lm, se=FALSE) +
geom_point()
## Error: A continuous variable can not be mapped to linetype
If we have a numeric variable that has only a few possible values, and we would like to show different line types for each of these, then we must first convert the numeric variable to a factor using factor(). To avoid changing the original numeric variable or creating a new one, we can apply factor() directly within the plot definition.
But with more than a few factor levels, the differences between line types become hard to distinguish. Linetype is best for just two or three levels.
ggplot(bw, aes(y=Birth_weight, x=Weight, linetype=factor(Visits))) +
geom_smooth(method=lm, se=FALSE) +
geom_point()
To keep a long plot definition more compact, some longer aesthetics have abbreviated names. linetype can be abbreviated to lty.
ggplot(marriage, aes(y=Marriage_age, x=Year, lty=Sex)) +
geom_line()
The shape aesthetic determines what symbol is drawn for points. Like linetype, it works only for a factor variable. Shape is often not such a useful aesthetic. It works best together with linetype, for showing the points along a progression.
ggplot(marriage, aes(y=Marriage_age, x=Year, linetype=Sex, shape=Sex)) +
geom_line() +
geom_point()
With more than just two levels, shapes quickly become very difficult to distinguish on a crowded plot.
ggplot(bw, aes(y=Birth_weight, x=Weight, shape=Race)) +
geom_point()
The group aesthetic is a special one. Unlike the other aesthetic mappings, group does not cause the appearance of geoms to vary with the values of a variable. However, it does still display geoms separately for each of the values of a variable. So the result of mapping a variable to group is that we see separate geoms for each of the values of the variable, but those geoms all have the same color, line type, etc.
One common use of the group aesthetic is to show separate lines for each subject when we have recorded data from multiple subjects. The separate lines give an idea of how consistent any trend is across the various subjects.
ggplot(salience, aes(y=Error, x=RT, group=Subject)) +
geom_point() +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Different geometric objects are useful for showing different types of data and different features of those data. In addition, each geom is affected by some aesthetic mappings and not by others. So when choosing geoms for our plot we need to think carefully about which will show the important information clearly, and which may obscure it or mislead us.
The ? help for each geom_ function often gives some guidance on the appropriate use of the geom. It also gives a list of the aesthetics that the geom understands (under the heading Aesthetics). These help pages are linked in the sections below.
The point is a very useful geom because it can show every individual observation, and therefore does not discard any information. We have seen points used in several examples above. The main disadvantage to points is ‘overplotting’; if we have a large number of observations then the result may look like a solid cloud and any overall trend may be lost.
ggplot(salience, aes(y=Error, x=RT)) +
geom_point()
A slight improvement can be made if we give each point a distinguishable outline. We can change this using arguments to the geom_point() function. Many arguments that change the appearance of a geom have the same name as the aesthetics that control those aspects of appearance. By specifying them in the geom function we just set them to a fixed value instead of mapping them to a variable. To give points distinct outlines, we can choose a shape that allows different colors for its outline and for its interior, such as a circle, and then fill the interior with a different color from the default point color (which as we can see from the example above is black).
ggplot(salience, aes(y=Error, x=RT)) +
geom_point(shape="circle filled", fill="grey")
This can also make smaller plots clearer, particularly if we have also caused some points to overlap by varying their size.
ggplot(fat, aes(y=Waist, x=Weight, size=Fat)) +
geom_point(shape="circle filled", fill="grey")
Another option for making the overall trend clearer when we have a large cloud of points it to add a smooth trend line with geom_smooth(), as we have seen above.
Although points are a natural choice of geom for numeric x scales, we can also use them with a factor variable mapped to the x dimension. The result is to show a spread of points for each level of the factor.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_point()
But this does not give a very clear impression of the data because so many points overlap. We can improve this by jittering the points: spreading them out randomly to either side. The position argument for geom_ functions allows us to fine-tune their positioning. The input to this argument is a ggplot position_ function. position_jitter() applies jittering to points. The arguments to position_jitter() are in turn width and height. These determine the maximum horizontal and vertical distance that the points will be jittered. The units of width and height are the units of the variables that we have mapped to the x and y dimensions of the plot. If we have mapped a factor variable, then the scale is such that the distance between two neighboring categories equals 1. Therefore, to avoid points jittering over into the wrong category, we should keep the jitter for the factor axis well below 0.5.
In our current example, we do not want any vertical jitter, as this would alter the apparent birth weights of the babies, which is misleading. We want only a bit of horizontal jitter.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_point(position=position_jitter(width=0.1, height=0))
Because jittered points are fairly commonly used, a convenience function geom_jitter() is provided that combines geom_point() with position_jitter(). This gives the same result.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_jitter(width=0.1, height=0)
Lines are good for showing a progression along a numeric scale. For this, we need a numeric variable mapped to the x dimension, and not more than one observation at each of the values along the x dimension. These criteria are fulfilled in the marriage data, for example, as we saw in the line plot we created above.
If we have more than one observation at each of the values along the x dimension, a line will join them in an arbitrary order, giving an unclear impression of the data.
ggplot(bw, aes(y=Birth_weight, x=Age)) +
geom_line()
We can also use a line to illustrate a progression across levels of a factor variable. However, this only makes sense if the levels of the variable have a meaningful order. This is the case for example in the deaths data, where the age groups are recorded in a factor variable.
levels(deaths$Age)
## [1] "0 to 1" "1 to 4" "5 to 14" "15 to 24"
If we add a line to a plot with a factor variable mapped to the x axis, the result is unfortunately not automatically a line showing the progression across levels.
accidents_1950 = subset(deaths, Cause=="accidents" & Year==1950)
ggplot(accidents_1950, aes(y=Deaths, x=Age)) +
geom_line()
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
The warning message mentions that ‘each group consists of only one observation’. What does this mean? The problem is that ggplot’s default behavior is to display geoms separately for the different levels of a factor variable wherever possible. But when we have only one observation for each level of a factor, it is not possible to draw a line through just one observation.
The warning message also hints at a solution. The solution to this problem is to override the default grouping using the group aesthetic. To instruct ggplot not to display geoms separated into groups at all, we must set the group aesthetic to 1 (meaning ‘put all observations together in just 1 group’). This allows a line to join up observations that are in differen categories.
ggplot(accidents_1950, aes(y=Deaths, x=Age, group=1)) +
geom_line()
In case there are some other grouping variables for which we would still like to display separate lines, then we should assign these variables to the group aesthetic.
accidents = subset(deaths, Cause=="accidents")
ggplot(accidents, aes(y=Deaths, x=Age, linetype=factor(Year), group=Year)) +
geom_line()
This use of the group aesthetic can be a bit tricky, and sometimes requires some trial and error before producing the plot that we want.
One thing that we should definitely not use lines for is to join up factor levels that do not have a meaningful order. This creates the impression of a meaningful progression where the order of the factor levels is in fact arbitrary.
For example the Cause variable in the deaths data does not have a meaningful order:
deaths_young_2005 = subset(deaths, Age=="0 to 1" & Year==2005)
ggplot(deaths_young_2005, aes(y=Deaths, x=Cause, group=1)) +
geom_line()
A column (or bar) varies its height according to the value of the y variable. In order for the height of the column to accurately reflect the value of the y variable, the y variable should have a meaningful zero point, i.e. a value of zero should indicate that there is none of whatever property the variable measures. This is most commonly the case for variables that count up how many instances of an event have occurred; a value of zero indicates that the event did not occur at all.
geom_col() adds columns to a plot.
ggplot(deaths_young_2005, aes(y=Deaths, x=Cause)) +
geom_col()
A common use of columns is simply to count up how many observations we have. The geom_bar() function does this. It maps the count of observations to the y dimension, without us having to specify a y mapping.
ggplot(tt, aes(x=Class)) +
geom_bar()
Behind the scenes, geom_bar() calculates the number of observations in each category of the factor variable that we have mapped to the x dimension. If we want, we can access these calculated variables in our plot definition. They are enclosed in .. .. to distinguish them from the original variables in our data frame. So the count of observations is referred to as ..count... If we map this variable to the y dimension, the result is the same as the default behavior of geom_bar().
ggplot(tt, aes(y=..count.., x=Class)) +
geom_bar()
Some geoms calculate more than one new variable. geom_bar() also calculates the proportion of observations in each category. We can map this to the y dimension instead.
ggplot(tt, aes(y=..prop.., x=Class)) +
geom_bar()
However, as we can see above, this does not get us the number of observations in each category as a proportion of the total number of observations. The cause of the problem is the same as that we encountered above for geom_line() with a factor variable: ggplot by default calculates and displays everything separately for each level of the factor. The number of observations in a category as a proportion of the number of observations in that same category is always 1, and therefore not very informative. The solution is the same as we saw earlier. We need to use the group aesthetic to tell geom_bar() not to calculate proportions in isolation within each category, but with respect to the entire ungrouped set of data.
ggplot(tt, aes(y=..prop.., x=Class, group=1)) +
geom_bar()
geom_bar() is reserved specially for counts and proportions. It doesn’t work with some other variable mapped to the y dimension. We will get an error if we try. If we want bars for a variable from our data frame, then we need geom_col() instead, as we saw above.
ggplot(deaths_young_2005, aes(y=Deaths, x=Cause)) +
geom_bar()
## Error: stat_count() must not be used with a y aesthetic.
Columns and bars can be filled with color, so we can map a second factor variable to the fill dimension to show combinations of the levels of two factors.
ggplot(tt, aes(x=Class, fill=Status)) +
geom_bar()
By default, bars in different colors are positioned stacked on top of each other. It is often clearer to show them side by side if we want to compare their heights. The position argument can achieve this. The term for placing objects side by side is dodging, and there is a positioning function for this: position_dodge().
ggplot(tt, aes(x=Class, fill=Status)) +
geom_bar(position=position_dodge())
Remember that the color aesthetic controls the color of outlines for filled shapes. This is rarely what we want for bars, as it is not easy to see.
ggplot(tt, aes(x=Class, color=Status)) +
geom_bar(position=position_dodge())
Bars are not so good for numeric x variables with many different possible values or where y values are very far from zero. Lines are clearer for this.
ggplot(marriage, aes(y=Marriage_age, x=Year, fill=Sex)) +
geom_col(position=position_dodge())
We have seen that points are good for showing the full details of our data, as they can show every individual observation. But often this will be too much if we have a large number of observations. A subtle overall trend may be obscured in a cloud of points.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_jitter(width=0.2, height=0)
If we wish to compare the levels of a factor variable side by side, boxplots provide a good compromise between detail and summary. They compress the individual observations into a summary based on a few numbers.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_boxplot()
What numbers do the boxplots show? Some of these numbers we will explore in more detail in a later tutorial. For now we will look at a brief explanation of each. The summary numbers can be broken down into three components. First the ‘box’ at the center of each boxplot:
The result is that the box shows the range of the central half of the observations, and therefore gives an indication of where along the y scale most of the observations are located.
Then the ‘whiskers’ that extend outwards from the box. Their definition is somewhat convoluted:
Finally, the individual points:
Note that it is possible that there are no outliers. This is the case for example among the birth weights for babies of the non-smoking mothers in the plot above.
If we would like to have both the compact summary provided by the boxplots and the more detailed view of the individual observations provided by points, we can. We can just add points after the boxplots.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_boxplot() +
geom_jitter(width=0.1, height=0)
However, if you look carefully at the plot above you will notice a small detail that makes the plot very slightly misleading. The outlier shown by the boxplot as an individual point is drawn again by geom_point(), making it look as though there are two such extreme observations. To prevent this from occurring, we can use the outlier.shape argument for geom_boxplot() to turn off the display of outliers. This argument determines what shape will be used to display the outliers, and we can set this shape to be an empty piece of text ("").
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_boxplot(outlier.shape="") +
geom_jitter(width=0.1, height=0)
Remember that the order in which geoms are added makes a difference to the layering of the objects on the plot.
ggplot(bw, aes(y=Birth_weight, x=Smoker)) +
geom_jitter(width=0.1, height=0) +
geom_boxplot(outlier.shape="")
If we want to use boxplots to compare combinations of the levels of two factor variables, we can map one of them to the fill dimension. The differently-colored boxplots are automatically aligned side-by-side. When deciding which factor variable to map to the x dimension and which to the fill dimension, we should usually choose the most important factor for the fill dimension. This ensures that the different levels of this factor will be displayed immediately next to each other within each level of the other factor. Which factor variable is more important depends on what we want to know from the data.
For example, if we are most interested in the relationship between smoking and birth weight in the birth weights data, we should map the smoking variable to the fill dimension when we are also including another factor variable.
ggplot(bw, aes(y=Birth_weight, x=Race, fill=Smoker)) +
geom_boxplot()
What if we also want to add points to a more complex collection of boxplots like the one above? Unfortunately, geom_point() (and geom_jitter()) do not take the fill variable into account when positioning the points.
ggplot(bw, aes(y=Birth_weight, x=Race, fill=Smoker)) +
geom_boxplot(outlier.shape="") +
geom_jitter(width=0.1, height=0)
For this, we need to return to the position argument for geom_point(). The positioning function position_jitterdodge() jitters points but also dodges them to align with all the groupings of observations defined in the other aesthetic mappings. Because its main purpose is to align points with boxplots, position_jitterdodge() applies by default no vertical jitter and an amount of horizontal jitter that matches fairly neatly the width of the boxplots, so we do not need to specify any width and height arguments unless we really need to fine-tune the appearance of the plot.
ggplot(bw, aes(y=Birth_weight, x=Race, fill=Smoker)) +
geom_boxplot(outlier.shape="") +
geom_point(position=position_jitterdodge())
Finally, what if we also want the points to reflect the fill aesthetic that we defined for the boxplots? The simplest way to achieve this is to set a fillable shape for the points, such as a circle, as we saw above when learning about the point geom.
ggplot(bw, aes(y=Birth_weight, x=Race, fill=Smoker)) +
geom_boxplot(outlier.shape="") +
geom_point(position=position_jitterdodge(), shape="circle filled")
A smooth line summarizes the overall trend in the relationship between numeric variables mapped to the x and y dimensions. As we saw above, this can be useful for displaying a subtle trend in a large cloud of points.
We will learn more about geom_smooth() in later tutorials when we consider methods for quantifying trends in data.
The text geom draws a text label for each observation. This requires an additional aesthetic mapping that we did not look at above: the ‘label’ aesthetic. This aesthetic has no effect on most geoms. Its effect on geom_text() is as expected: we see the value of the mapped variable printed on the plot. This is most useful when the mapped variable just contains an individual name for each observation.
For example, in the wine data, we can add the individual names of the wines to a plot of points.
ggplot(wine, aes(y=Fruity, x=Citrus, label=Name)) +
geom_point() +
geom_text()
Unfortunately, text is not automatically positioned so that all of it falls within the bounds of the plot. Pieces of text also overlap with points and sometimes with each other. This makes text difficult to get right. We can make some improvement by aligning the text differently. The hjust and vjust arguments ‘justify’ the text in the horizontal and vertical directions. For example, if we input hjust="left" and vjust="top", the bottom left corner of the text is placed at the xy position of the observation.
ggplot(wine, aes(y=Fruity, x=Citrus, label=Name)) +
geom_point() +
geom_text(hjust="left", vjust="bottom")
There are two special inputs to the alignment arguments. "inward" and "outward" move text towards the middle or the edges of the plot, respectively. The "inward" option can be useful for keeping text within the bounds of the plot, although sometimes at the cost of causing text to overlap.
ggplot(wine, aes(y=Fruity, x=Citrus, label=Name)) +
geom_point() +
geom_text(hjust="inward")
There are several useful geoms that we can use to put additional information on a plot. Often these geoms are used without an aesthetic mapping, just to show one value that is of importance for the interpretation of the data.
For example, the hline geom adds horizontal lines across the full width of the plot. The position of the lines is determined by the yintercept argument, which specifies where on the y scale the lines should be placed vertically. We can use this to indicate an important threshold, for example zero error in the salience data.
ggplot(salience, aes(y=Error, x=RT)) +
geom_point(shape="circle filled", fill="grey") +
geom_hline(yintercept=0, lty="dashed", color="red")
We can input a vector of values instead of a single number, and the geom is drawn once for each value in the vector. This is useful for indicating a specific range of the y scale, for example the range of birth weights typically considered healthy.
ggplot(bw, aes(y=Birth_weight, x=Weight)) +
geom_point() +
geom_hline(yintercept=c(2.5,4.5), lty="dashed", color="red")
(There is of course also a geom_vline() for adding vertical lines, and its positioning argument is xintercept.)
In the examples above we defined the data and the aesthetic mappings for the whole plot, and then each geom that we layered onto the plot displayed these data and mappings. Most of the time, this is what we want; the whole plot should conform to the same overall organization. However, occasionally we will want to apply a particular aesthetic mapping, or even different data, to just one of the geoms on the plot. For example, we may want one geom to map a particular variable to the color dimension, but it may be clearer for a different geom to map the same variable to the fill dimension instead. Or we may want one geom to display only a subset of the data.
This is easy to achieve with ggplot. The data and aesthetic mappings can also be defined separately for a geom when we add that geom to the plot. We simply supply the data and/or mappings as input to the geom_ function. Whereas the overall plot definition with the ggplot() function takes the data as its first input and the aesthetic mappings as the second, this is the other way around for the geom_ functions. Although this can be confusing, it has a certain logic, since the data are the most important and fundamental thing for the plot as a whole (and aesthetic mappings can be added to it later), but the mappings are the most important and most frequently changed thing for geoms. So if we want a geom to have an additional aesthetic mapping, we must give the aes() function as the first input to the geom_ function.
One use of a custom mapping like this is to ensure that a variable is reflected in the fill color for filled points, but in the line color for lines.
ggplot(bw, aes(y=Birth_weight, x=Weight)) +
geom_point(aes(fill=Smoker), shape="circle filled") +
geom_smooth(aes(color=Smoker), method=lm)
An alternative way of achieving the same thing is to define all the aesthetics for the main plot, but then ‘switch off’ some of them for specific geoms. We can switch an aesthetic mapping off by assigning the value NULL.
ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker, fill=Smoker)) +
geom_point(aes(color=NULL), shape="circle filled") +
geom_smooth(aes(fill=NULL), method=lm)
Notice that the mappings defined in the main plot function ggplot() still apply to the geoms that have their own mappings. Geoms inherit the mappings from the main plot, and then add or change any new mappings that we define just for them. (If we ever want to stop a geom from inheriting the aesthetic mappings from the main plot, we can set the argument inherit.aes=FALSE in the geom_ function, but this is not necessary very often).
Geoms may also display different data than those used for the main plot definition. One use of this is to use a geom to highlight only a small subset of the data. The text geom is frequently used with a subset of data, because applying it to the entire data frame usually results in a very cluttered plot.
For example, we can add the name of the least fruity wine to the plot of wines.
ggplot(wine, aes(y=Fruity, x=Citrus)) +
geom_point() +
geom_text(aes(label=Name), subset(wine, Fruity<3 & Citrus<2), hjust="inward")
Because the data are the second argument to geom_, we must assign the data argument by name (i.e. as data=) if we are not also supplying an aesthetic mapping to the geom_ function.
ggplot(wine, aes(y=Fruity, x=Citrus, label=Name)) +
geom_point() +
geom_text(data=subset(wine, Fruity<3 & Citrus<2), hjust="inward")
In the plots above, we relied on ggplot to fill in labels for the dimensions of the plot. It took these labels from the names of the variables that we mapped to the plot dimensions. Where feasible, we can just name the variables appropriately in the original data frame, and then their plot labels will be as we want them. But fairly often we will want to modify the labels on the plot. If we want labels with spaces in them, or if we want to state the units of the variable, it is very unwieldy to incorporate this into the name of the variable in the data frame. An alternative is to specify labels manually using the labs() function.
As with aes(), the arguments to the labs() function are assignments to plot dimensions. Because the labels are just literal text, and do not refer to a function or object in R, they must be given in quotation marks.
fig1 = ggplot(bw, aes(y=Birth_weight, x=Weight, color=Smoker)) +
geom_point() +
labs(y="Birth weight (kg)", x="Mother's weight (kg)")
print(fig1)
We can also use labs() to assign a caption. We don’t always need this, but one good use of the caption is to provide a brief reference for where the data come from. This can be important for plots that might be shown without their original context, because it ensures that a record of the source is built in to the plot image.
fig1 = fig1 +
labs(caption="Data: Baystate Medical Center, 1986")
print(fig1)
labs() can also assign a title at the top of the plot. Again, we won’t always need this, but one use of a title is to assign a distinguishing label to a plot that we will later use as one of several plots presented at once.
fig1 = fig1 +
labs(title="Study A")
print(fig1)
Since writing out labels can be tedious, and we don’t want to have to repeat it for multiple plots of the same data, the labels should be one of the first things we add to a basic plot definition. We can then add different geoms and mappings to this basic plot, and the labels will be reflected in each new version we create.
fig0 = ggplot(bw, aes(y=Birth_weight)) +
labs(y="Birth weight (kg)", caption="Data: Baystate Medical Center, 1986")
fig1 = fig0 +
aes(x=Weight) +
geom_point()
fig2 = fig0 +
aes(x=Smoker) +
geom_boxplot()
# ... and so on.
So far, we have mostly been creating plots that display all of the observations in our data frame. This is usually what we want to do, at least for an initial plot. Otherwise we might miss important features of our data, such as single extreme observations that warrant extra attention. But we have already encountered a few ggplot functions that calculate and display some summary statistics from the data rather than the data themselves.
geom_smooth() summarizes a smooth relationship between x and y variablesgeom_bar() counts up how many observations there are in a group, or the proportion of observations in a groupgeom_boxplot() shows a summary of the spread of the observations, using the median and the IQR, as we learned aboveThere is a more general ggplot function for first summarizing the data in a few statistics, and then displaying those statistics instead of (or as well as) the data themselves. The stat function stat_summary() does this.
Let’s see what it does by default for a simple side-by-side comparison of a two-level factor variable.
fig_smoking = fig0 +
aes(x=Smoker) +
stat_summary()
print(fig_smoking)
## No summary function supplied, defaulting to `mean_se()
As we have seen before, a function that has a specific default behavior that we might want to change will warn us what default options it is using (for example, we saw this for the smoothing method of geom_smooth() earlier). stat_summary() requires a ‘summary function’: an R function that will take the data as its input and will output one or more summary statistics to be plotted. We are told here that the default summary function for stat_summary() is mean_se(). This function calculates the mean value of the y variable, along with its Standard Error (SE). We will learn more about Standard Errors in a later tutorial, but for now consider the Standard Error as giving a sort of ‘margin of error’ for estimating the true value of some statistic in the general population from which our observations were drawn.
Let’s look at the output of mean_se() when applied to the birth weights of the babies born to non-smoking mothers.
mean_se(bw$Birth_weight[bw$Smoker=="no"])
The output comes in the form of a data frame with three columns:
Together, ymin and ymax provide a margin of error for estimating the mean value of the y variable in the general population from the values in our data. This can be a useful way of summarizing our data if the goal of our project is to conclude something about the population. It is these values that stat_summary() displays. We can see this if we compare the values we just got from mean_se() to the left half of the plot above.
stat_summary() takes a summary function and applies it separately to each grouping that we defined for the plot. Because we mapped the smoking factor variable to the x dimension, we see the mean ± the Standard Error displayed for each level of this factor separately.
We can supply a summary function to the fun.data argument for stat_summary(). If we supply the mean_se() function, we get the same plot as by default.
fig_smoking = fig0 +
aes(x=Smoker) +
stat_summary(fun.data=mean_se)
There are alternative summary functions. For example, mean_sdl() calculates the mean value of the y variable ± 2 Standard Deviations (SDs). Unlike the Standard Error, the Standard Deviation is not concerned with estimating anything in the population in general, and is just a description of how spread out our observed data are along the y scale. Again, this is something that we will learn about in more detail later.
mean_sdl(bw$Birth_weight[bw$Smoker=="no"])
fig_smoking = fig0 +
aes(x=Smoker) +
stat_summary(fun.data=mean_sdl)
print(fig_smoking)
For simpler summaries that summarize the data in a single number, we can use a summary function that calculates a single value. We have seen some such functions already in an earlier tutorial, such as mean() and median(). Because the fun.data argument expects a function that outputs a range with a ymin and a ymax, we must use a different argument for single-number summaries: fun.y. We must also tell stat_summary() which geom we want to use to display the summary, since the default is to use a point with lines (geom_pointrange). We can supply the name of the geom to the geom argument.
fig_smoking = fig0 +
aes(x=Smoker) +
stat_summary(fun.y=mean, geom="point")
print(fig_smoking)
But a plot with just a single summary statistic on it is very low in information. More often we will want to use stat_summary() as an addition to displaying the individual observations. For example the individual observations as points and then a summary pointrange on top.
fig_smoking = fig0 +
aes(x=Smoker) +
geom_jitter(width=0.1, height=0) +
stat_summary(fun.data=mean_sdl, color="red")
print(fig_smoking)
To make sure our plot can be interpreted correctly, we should always mention in our written description of the plot what statistics the points and ranges of the summary show.
In all our example plots so far, ggplot has decided for us which values to label along the axes of the plot. And it tends to make fairly good decisions about this. But sometimes we will want to take control of this aspect of our plot ourselves. For example, if we are showing values on a rating scale that uses particular numbers, such as ratings from 0 to 10, we might want to show only these numbers along the plot axis.
ggplot’s scale_ functions customize the appearance of scales. There are lots of such functions, and it can sometimes take a bit of work to find out which one we want. But generally the function will have the name of the plot dimension whose scale we wish to alter, plus some indication of what kind of scale we want to apply. For example, if we want to change the values shown for the birth weights, we need the function scale_y_continuous(), because we are dealing with the y dimension and because birth weight is a numeric variable with a continous scale.
The breaks argument tells the scale_ function at what points on the scale we would like to label the values. The input is a vector of values. The breaks are reflected in the numbering along the axis and in the placing of gridlines in the plot background.
wine_ratings = ggplot(wine, aes(y=Overall_preference, x=Label)) +
geom_jitter(width=0.1, height=0)
wine_ratings +
scale_y_continuous(breaks=0:10)
The limits argument sets the lower and upper end of the scale. The input for this is a vector of two values (minimum then maximum).
wine_ratings +
scale_y_continuous(breaks=0:10, limits=c(0,10))
And the labels argument allows us to specify something other than numbers for the labelling of the values. The input is a vector of the same length as the breaks argument.
wine_ratings +
scale_y_continuous(breaks=c(0,5,10), labels=c("bad","medium","good"), limits=c(0,10))
To change the colors assigned to factor levels by the color and fill aesthetics, we can use scale_color_manual(). The required input is again a vector, but this time a vector of color names, one for each level of the factor, in an order corresponding to the order of the factor levels. Since these are not ‘break points’ along a scale, the argument is called values rather than breaks. In addition, we can determine which aesthetic mappings the scale applies to, using the aesthetics argument. If we input the name of a single aesthetic, for example "fill", the scale will be applied to that aesthetic, but we can also input a vector of names of aesthetics to apply the same scale to both the color and fill dimensions.
levels(bw$Smoker)
## [1] "no" "yes"
fig_smoking = fig0 +
aes(x=Weight) +
scale_color_manual(values=c("skyblue","brown"), aesthetics="fill") +
geom_point(aes(fill=Smoker), shape="circle filled")
print(fig_smoking)
And to change the continuous color gradient for a numeric variable, we can use scale_color_gradient(). The low and high arguments specify the color at each end of the scale, and a smooth gradient of color is filled in betwen them to represent the values along the scale.
fig_fat = ggplot(fat, aes(y=Waist, x=Weight, fill=Fat)) +
labs(y="Waist circumference (cm)", x="Weight (kg)", fill="Proportion\nbody fat") +
scale_color_gradient(low="yellow", high="red", aesthetics="fill") +
geom_point(shape="circle filled",size=3)
print(fig_fat)
(If you are wondering what named colors are available in R, you can see a full list of them by typing colors() into the console.)
There are also several packages for R that provide special color scales. For example, the viridis package provides a color scale that is more easily readable by people with the most common forms of color blindness.
library(viridis)
fig_fat + scale_color_viridis(aesthetics="fill")
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
If we want to compare levels of a factor variable, such as smokers and non-smokers, different varieties of wine, and so on, we can map these variables to the x or color dimensions, as we did in many of the examples above. But sometimes the simplest way to compare subsets of the data is in two or more separate plots, shown side by side. Indeed, if we have already used the x and color dimensions for something else in our plot, then splitting the plot into separate panels may be our only choice if we want to show one more variable.
Separate panels showing subsets of the data are called ‘facets’. ggplot’s facet_ functions apply facets to a plot. To split a plot into separate panels, we just add a facet_ function to the plot, specifying which variable to split the data by. facet_wrap() handles the simplest case of splitting the data by just one variable. The input to facet_wrap() is a formula for splitting the data. We have not yet learned about formulas in R, and we will do so in more detail in a later tutorial. For now, just remember that a formula contains a ~ symbol, and that the variable (or variables) that accompany the ~ are used to split the data.
So to split our plot of birth weights into three panels, one for each ethnic group:
fig_smoking + facet_wrap(~Race)
One common use of facets is to check the behavior of each subject separately when we have collected data from multiple subjects. If a faceted plot has many separate panels, facet_wrap() organizes them into a grid.
fig_salience = ggplot(salience, aes(y=Error, x=RT, fill=SOA)) +
labs(x="Reaction Time (ms)") +
geom_point(shape="circle filled") +
geom_hline(yintercept=0, lty="dashed", color="red")
fig_salience + facet_wrap(~Subject)
By default, each facet has the same scale for the x and y axes This is usually what we want, because we would like to be able to identify subjects who are in very different places along the scale. However, if we do not care so much about comparing the subjects all across the same scale, then we can allow each one to have their own scale, using the scales argument. Setting scales="free" will allow both x and y scales to differ for each facet, and setting scales="free_x" or scales="free_y" will allow only one of the scales to differ.
fig_salience + facet_wrap(~Subject, scales="free")
If we wish to show facets for the combination of levels from two factor variables, we can arrange the facets in a grid such that each row represents a level of one variable and each column represents a level of the other variable.
facet_grid() does this. The variable before the ~ in the formula is assigned to the rows, and the variable after the ~ to the columns.
fig_salience + facet_grid(Luminance~Orientation)
In the example above, the titles for each facet are not very helpful. This is because the levels of the Luminance and Orientation variables are just ‘absent’ and ‘present’. If we would like to see not only the names of the levels but also the names of the variables in the facet titles, we can set the argument labeller="label_both".
fig_salience + facet_grid(Luminance~Orientation, labeller="label_both")
If our data frame has changed, for example if we have recalculated the values of a variable, or have discarded some observations, then we will want to create a new plot for our modified data frame. Rather than write out all the plot commands again, we can update the existing plot by ‘adding’ the new data frame.
The symbol for adding a new data frame to a plot is slightly different from that for adding other plot features: %+%. (Commands enclosed in % % represent special, non-standard uses of a symbol or word. We won’t need to use these very often, so you can just remember that %+% means ‘change the data for a plot’.)
For example, if we discard some of the observations from the salience data set by subsetting the data frame, we can then add the subsetted data frame to the existing plot to see the result of the change.
salience_fast = subset(salience, RT<400)
fig_salience %+% salience_fast
It is important to remember that the new data frame replaces the original one, and is not drawn onto the plot additionally, as the + in %+% seems to suggest.
Perhaps we want to publish our plots in a journal or book that has specific style guidelines. For this, we will need to change more specific aspects of the plot’s appearance, such as the presence of gridlines, the placement of the legend, and so on. The theme() function makes changes to the appearance of a plot. This function has a great many different arguments, each of which controls one minor stylistic detail. They are too many to go into here, but you can look them up in the documentation to the theme() function should you ever need them.
A few ‘prepackaged’ theme functions are provided, which make multiple changes to a plot according to a particular style. For example, the ‘classic’ theme does not have guiding gridlines or a shaded background.
fig_smoking + theme_classic()
Of course we also want to save our plots as images so we can put them in articles, websites, presentations, and posters. One way of saving a plot is via the Export button in RStudio’s Plots tab. This is good for saving a ‘one-off’ plot that we aren’t likely to want to come back to and modify. But if we want to make the creation of the image reproducible, then we should include it as a command in our R script or markdown file. This allows us to come back and just run our entire analysis again, maybe with new data, and get the new plot image automatically.
The ggsave() function saves a plot to an image file. The first input is the name we want to give the file, and the second input is the plot object. We use the suffix of the filename to specify what image format we want.
PNG (Portable Network Graphics) is a good multi-purpose image format that usually does not result in a huge file size, but preserves a decent image quality. It is good for displaying on a website.
Here we save one of the plots already created above as a png file:
ggsave("example_figure.png", fig_smoking)
## Saving 7 x 5 in image
A message informs us of the function’s default behavior. In this case it refers to the dimensions of the image. If we want to change the width and height of the image, we have to specify the widthand height arguments.
The units of the image dimensions are by default inches (abbreviated to ‘in’ in the message above). Small values in the range of 2 or 3 will give a small, chunky image in which the text and objects are large relative to the overall plot size. Values much larger than 10 will give a more sparse-looking plot, in which text and points are relatively small. You will often need to experiment a bit until you find a size that looks good.
ggsave("example_figure.png", fig_smoking, width=5, height=3)
If we are going to display our image in a very large size, for example on a poster or a big screen, then a scalable format such as SVG (Scalable Vector Graphics) is best. Instead of being stored as pixels, which will get fuzzy when the image is scaled up to a large size, an svg image is stored as a description of lines and shapes, so it stays crisp at whatever size it is scaled up to.
ggsave("example_figure.svg", fig_smoking, width=5, height=3)
(In order to create svg files on your own computer you may need to first install the R package svglite.)